Goal of this project is to find characteristics of texts from 3 popular horror authors, identify similarities and differences in their texts in the spooky dataset. Data consists of excerpts of texts written by Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS).
packages.used <- c("ggplot2", "plotrix", "waffle", "dplyr", "tibble", "tidyr", "stringr", "tidytext", "topicmodels", "wordcloud", "plotly", "webshot", "htmlwidgets", "reshape2")
# check packages that need to be installed.
packages.needed <- setdiff(packages.used, intersect(installed.packages()[,1], packages.used))
# install additional packages
if(length(packages.needed) > 0) {
install.packages(packages.needed, dependencies = TRUE, repos = 'http://cran.us.r-project.org')
}
library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(wordcloud)
library(plotrix)
library(waffle)
library(plotly)
library(webshot)
library(htmlwidgets)
library(reshape2)
spooky.csv in data folder, and this Rmd inside doc folder.
spooky <- read.csv('../data/spooky.csv', as.is = TRUE)
Take a look of first few rows and dimension of the dataset
head(spooky, 3)
## id
## 1 id26305
## 2 id17569
## 3 id11008
## text
## 1 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
## 2 It never once occurred to me that the fumbling might be a mere mistake.
## 3 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.
## author
## 1 EAP
## 2 HPL
## 3 EAP
dim(spooky)
## [1] 19579 3
Quick scan on if any missing value, variable typle
#Check to see any missing value in the dataset
sum(is.na(spooky))
## [1] 0
#change `author` type from `character` to `factor` for analysis
class(spooky$author)
## [1] "character"
spooky$author <- as.factor(spooky$author)
class(spooky$author)
## [1] "factor"
How many texts do each author have in the dataset?
num_texts <- table(spooky$author)
num_texts
##
## EAP HPL MWS
## 7900 5635 6044
Plot composition of number of texts from 3 authors in pie chart, display counts and percentages
lbls <- paste(names(num_texts), '\n', num_texts, '\n', round(num_texts/sum(num_texts) * 100, 1), '%', sep = '')
pie3D(num_texts, labels = lbls, explode = 0.05, labelcex = 0.8)